Skip to content

Conversation

@avjves
Copy link
Contributor

@avjves avjves commented Nov 10, 2025

What?

Adds support for KV-cache and joint tensors to USP method

Why?

Currently some models, like Hunyuanvideo have diverging code paths based on the input parameters. The two paths (Yunchang / USP) have different implementations for comms. Yunchang path uses certain features from torch.distributed that are not compatible with torch.compile. As USP method can be fully compiled, this PR aims to make USP support the features the Yunchang path does, allowing us to only use USP.

This PR is the first step towards deprecating yunchang in the long term, as discussed in #579 , but does not aim to fully remove it in the short term.

How?

Ported Yunchang features directly to USP method. This includes the joint tensors as well as KV cache for pipeline parallelism. Also changed Hunyuanvideo and Flux to already use only USP rather than Yunchang path.

Tests

Output:

Hunyuanvideo:

Tested both Ring/Ulysses.

Hunyuanvideo already uses USP by default if the input prompt is of a specific shape. The command and output below are from changing the other code path, where it previously used yunchang / hybrid_seq_parallel_attn.

Run command:

torchrun --nproc_per_node=8 examples/hunyuan_video_usp_example.py     --model tencent/HunyuanVideo     --prompt "In the large cage, two puppies were wagging their tails at each other."     --height 720 --width 1280 --num_frames 129     --num_inference_steps 50  --ulysses_degree 8     --enable_tiling --enable_slicing     --use_torch_compile
hunyuan_test_usp.mp4

Flux:

Flux uses standard USP already by default. Only in the case of pipeline parallelism did it use yunchang:

Run command:

torchrun --nproc_per_node=8 examples/flux_example.py     --model black-forest-labs/FLUX.1-dev     --seed 42     --prompt "A small cat"     --height 1024     --width 1024     --num_inference_steps 25     --max_sequence_length 256     --no_use_resolution_binning     --ulysses_degree 2        --pipefusion_parallel_degree 4
flux_test_pipeline_parallel

Perf

Hunyuanvideo

The swap from yunchang code path to USP code path improves the performance, as now we can use torch.compile for the attention call as well. Here we have timed three subsequent runs and reported the average:

Before: 193.2s
After: 188.3s

This now matches the perf of the original USP path.

Automatic tests

Added two new unit tests as well to compare the output of Yunchang / USP.

Other

This doesn't change the standard USP method behaviour, so other models using USP won't be affected. Some models still use Yunchang, so they would need to be changed in future PRs.

For ease of comparison, here's the original USP method:

    if get_sequence_parallel_world_size() == 1:
        out = _attention(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
    elif get_ulysses_parallel_world_size() == 1:
        out = ring_attn(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
    elif get_ulysses_parallel_world_size() > 1:
        query = _ft_c_input_all_to_all(query)
        key = _ft_c_input_all_to_all(key)
        value = _ft_c_input_all_to_all(value)

        if get_ring_parallel_world_size() == 1:
            out = _attention(query, key, value, dropout_p=dropout_p, is_causal=is_causal)
        else:
            out = ring_attn(query, key, value, dropout_p=dropout_p, is_causal=is_causal)

        out = _ft_c_output_all_to_all(out)

    return out

Copy link
Collaborator

@feifeibear feifeibear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feifeibear
Copy link
Collaborator

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great step towards unifying attention mechanisms by adding joint tensor and KV-cache support to the USP method. The goal of deprecating the yunchang path to improve torch.compile compatibility is well-motivated. The changes in xfuser/model_executor/layers/usp.py are substantial and well-supported by new unit tests that verify equivalence with the old implementation. The modifications in attention_processor.py and transformer_flux.py to adopt the new USP interface are consistent and correct. Overall, the changes are well-executed. I have one suggestion to improve the structure of the new tests for better maintainability.

@jcaraban jcaraban self-assigned this Nov 11, 2025
@jcaraban jcaraban self-requested a review November 11, 2025 19:22
Copy link
Collaborator

@jcaraban jcaraban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@feifeibear feifeibear merged commit b8ebdf7 into xdit-project:main Nov 12, 2025
@jcaraban jcaraban mentioned this pull request Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants